reasoning chain
Supplementary Materials for MEQA: A Benchmark for Multi-hop Event-centric Question Answering with Explanations
We utilize an open and widely used data format, i.e., JSON format, for the MEQA dataset. "context": "Roadside IED kills Russian major general [...]", # The context of the question "question": "Who died before AI-monitor reported it online?", "What event contains Al-Monitor is the communicator? "What event is after #1 has a victim? "Who died in the #2? major general,local commander,lieutenant general" We present a list of Datasheets [Gebru et al., 2021] for the MEQA dataset, synthesizing many of the For what purpose was the dataset created?
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.06)
- North America > United States > Texas (0.05)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.05)
- Asia > Middle East > Syria (0.04)
- (18 more...)
- Government (1.00)
- Law (0.93)
- Leisure & Entertainment > Sports > Basketball (0.67)
- Law Enforcement & Public Safety (0.67)
- Law (1.00)
- Information Technology (1.00)
- Leisure & Entertainment > Games > Computer Games (0.46)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (2 more...)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (2 more...)
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics
Song, Maojia, Liu, Renhang, Wang, Xinyu, Jiang, Yong, Xie, Pengjun, Huang, Fei, Zhou, Jingren, Herremans, Dorien, Poria, Soujanya
RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.
MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models
Zhang, Jusheng, Cai, Kaitong, Guo, Xiaoyang, Liu, Sidi, Lv, Qinhan, Chen, Ruiqi, Yang, Jing, Fan, Yijia, Sun, Xiaofei, Wang, Jian, Chen, Ziliang, Lin, Liang, Wang, Keze
The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Y et a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. T o fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. W e evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, i.e., revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.
- Europe > Austria (0.28)
- Asia > Middle East > UAE (0.28)
Toward an AI Reasoning-Enabled System for Patient-Clinical Trial Matching
Leach, Caroline N., Klusty, Mitchell A., Armstrong, Samuel E., Pickarski, Justine C., Hankins, Kristen L., Collier, Emily B., Shah, Maya, Mullen, Aaron D., Bumgardner, V. K. Cody
Screening patients for clinical trial eligibility remains a manual, time - consuming, and resource-intensive process. W e present a secure, scalable proof-of - concept system for Artificial Intelligence ( AI)- augmented patient - trial matching that addresses key implementation challenges: integrating heterogeneous electronic health record (EHR) data, facilitating expert review, and maintaining rigorous security standards. Leveraging open-source, reasoning-enabled large language models (LLMs), the system moves beyond binary classification to generate structured eligibility assessments with interpretable reasoning chains that support human-in - the - loop review. This decision support tool represents eligibility as a dynamic state rather than a fixed determination, identifying matches whe n available and offering actionable recommendations that could render a patient eligible in the future . The system aims to reduce coordinator burden, intelligently broaden the set of trials considered for each patient and guarantee comprehensive auditability of all AI - generated outputs. Introduction Applications of artificial intelligence (AI) in healthcare are increasingly focused on improving administrative efficiency and optimizing clinical workflows . Identifying relevant trials and screening them for a particular patient is traditionally manual, time - consuming, and heavily reliant on clinical expertise.
VulnLLM-R: Specialized Reasoning LLM with Agent Scaffold for Vulnerability Detection
Nie, Yuzhou, Li, Hongwei, Guo, Chengquan, Jiang, Ruizhe, Wang, Zhun, Li, Bo, Song, Dawn, Guo, Wenbo
We propose VulnLLM-R, the~\emph{first specialized reasoning LLM} for vulnerability detection. Our key insight is that LLMs can reason about program states and analyze the potential vulnerabilities, rather than simple pattern matching. This can improve the model's generalizability and prevent learning shortcuts. However, SOTA reasoning LLMs are typically ultra-large, closed-source, or have limited performance in vulnerability detection. To address this, we propose a novel training recipe with specialized data selection, reasoning data generation, reasoning data filtering and correction, and testing-phase optimization. Using our proposed methodology, we train a reasoning model with seven billion parameters. Through extensive experiments on SOTA datasets across Python, C/C++, and Java, we show that VulnLLM-R has superior effectiveness and efficiency than SOTA static analysis tools and both open-source and commercial large reasoning models. We further conduct a detailed ablation study to validate the key designs in our training recipe. Finally, we construct an agent scaffold around our model and show that it outperforms CodeQL and AFL++ in real-world projects. Our agent further discovers a set of zero-day vulnerabilities in actively maintained repositories. This work represents a pioneering effort to enable real-world, project-level vulnerability detection using AI agents powered by specialized reasoning models. The code is available at~\href{https://github.com/ucsb-mlsec/VulnLLM-R}{github}.
- North America > United States > California (0.67)
- North America > United States > Illinois (0.46)
Why They Disagree: Decoding Differences in Opinions about AI Risk on the Lex Fridman Podcast
Truong, Nghi, Puranam, Phanish, Koçak, Özgecan
The emergence of transformative technologies often surfaces deep societal divisions, nowhere more evident than in contemporary debates about artificial intelligence (AI). A striking feature of these divisions is that they persist despite shared interests in ensuring that AI benefits humanity and avoiding catastrophic outcomes. This paper analyzes contemporary debates about AI risk, parsing the differences between the "doomer" and "boomer" perspectives into definitional, factual, causal, and moral premises to identify key points of contention. We find that differences in perspectives about existential risk ("X-risk") arise fundamentally from differences in causal premises about design vs. emergence in complex systems, while differences in perspectives about employment risks ("E-risks") pertain to different causal premises about the applicability of past theories (evolution) vs their inapplicability (revolution). Disagreements about these two forms of AI risk appear to share two properties: neither involves significant disagreements on moral values and both can be described in terms of differing views on the extent of boundedness of human rationality. Our approach to analyzing reasoning chains at scale, using an ensemble of LLMs to parse textual data, can be applied to identify key points of contention in debates about risk to the public in any arena.
- North America > United States (0.93)
- North America > Canada > Ontario > Toronto (0.28)
- Europe > United Kingdom > England (0.28)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Law (1.00)
- Government (1.00)
- Information Technology > Security & Privacy (0.92)